Main goals of the session
To put into practice the concepts of species divergence and evolutionary rates that you will learn in the theory lessons, you will analyse nucleotide divergence in two genes (obp83b and rpL32) with different biological functions in five closely related species of the genus Drosophila.
Throughout the document you will see different icons whose meaning is:
: Additional or useful information
: Practical exercise
: Hint to solve an exercise or to do a task
: Slot to answer a question
: Problem or task to be solved
You can use either the R console in the terminal or
RStudio for this practice. If you don’t have R
installed, you can download the appropriate package for your system here. To install
RStudio, go to this page and
follow the instructions.
Before starting the exercises, you will need to install some
necessary libraries for downloading sequences from the GenBank
database, performing multiple sequence alignments, building phylogenetic
trees, and calibrating these trees. Open the R console in
the terminal (or in RStudio) and type:
packages <- c("reutils", "ape", "seqinr", "phangorn", "phylotools")
install.packages(setdiff(packages, rownames(installed.packages())))
You can use the efetch() function implemented in the
reutils package to retrieve sequences from databases. These
functions allow you to connect to GenBank and download the
sequences directly to your computer in FASTA format. You will need the
GenBank identifier of either the gene region or the complete
genome sequence of the species and the coordinates of the desired gene
region in that genome. Note that for some species the gene is
encoded in the reverse complement sequence relative to the reference in
the database. In these cases you will need to specify the strand in
efetch. In the example below
Let’s see how to retrieve the coding regions (CDS) of the obp83b gene in different species. Table 1 lists the identifiers and coordinates of this gene region in the five species:
Table 1. Identifiers and genomic coordinates of the obp83b gene in different Drosophila species.
| Species | ID | Coordinates in the genome | Strand |
|---|---|---|---|
| D. melanogaster | NT_033777 | 5976177-5976817 | 2 |
| D. simulans | NC_052523 | 2470507-2471257 | 2 |
| D. erecta | NW_020825194 | 26500029-26500718 | 1 |
| D. elegans | NW_024545863 | 12746083-12746802 | 2 |
| D. pseudopobscura | NC_046679 | 16336491-16337172 | 1 |
GenBank_ identifiers can be obtained from the literature (articles) by starting a search with R libraries such as
rentrez, directly from the web page of this database, usingesearch()fromreutils, or using other programmes and more advanced bioinformatics tools. .
To download and write a file in FASTA format with the CDS sequence of
the obp83b gene in these species, you can use the
efetch() function of reutils.
library(reutils)
## create a new working directory
dir.create("divergence")
## example for D. subobscura:
efetch("NC_046679", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=1, seqstart=16336491, seqstop=16337172, outfile="./divergence/Dpse_obp83b_cds.fasta")
db, specifies the database;rettype, specifies the subsequence to be extracted (=CDS) and the format (=FASTA);retmode, specifies the file format (=plain text);strand, 1: positive strand, 2: negative (reverse complement with respect to the reference in the database);seqstartandseqstop, are the coordinates of the sequence to be downloaded (required if we are extracting gene regions from the complete genome sequence set).
read.dna() and
write.dna() functions in the ape package.To identify the nucleotide substitutions that have occurred in the
coding region of the opb83b gene during the divergence of the
four species, one must first predict which positions are homologous
(i.e. derived from a common ancestor). There are many algorithms and
programs for constructing multiple sequence alignments of DNA sequences.
For purely practical reasons, in this session you will use
ape, an R package that implements some of
these algorithms.
library(ape)
## read the sequences
myseqs<-read.dna("divergence/obp83b_cds.fasta", format="fasta")
## visualize the sequences
myseqs
## multiple sequence alignment
myalign <- clustal(myseqs)
## partial visualization of the generated alignment
myalign
## write a file (maintaining the gaps introduced in the msa step)
write.dna(myalign, file="divergence/obp83b_cds_aln.fasta", format="fasta")
Coding regions are special for molecular evolutionary analyses because they contain two different types of positions in terms of the functional consequences of mutations, namely synonymous and nonsynonymous positions. Synonymous changes are those that do not alter the amino acid encoded by the codon, whereas nonsynonymous changes cause an amino acid substitution in the protein. Consequently, synonymous sites are sites where if a change occurs, it will be a synonymous change. The same applies to nonsynonymous sites. Studying and comparing the variability of these two types of sites is very informative about selective constraints and the functional consequences of mutations.
Divergence is estimated as the number of substitutions per site (this allows the divergence of regions of different lengths to be compared). Therefore, to estimate synonymous and nonsynonymous divergence, we first need to calculate the number of sites of each type in the coding regions of interest. The most commonly used method to estimate these sites is the Nei-Gojobori (N-G) method (Figure 1). The N-G approach consists of estimating the proportion of changes that would be synonymous and nonsynonymous in a given codon after considering all nine possible nucleotide changes (based on the genetic code).
Figure 1. Nei-Gojobori method for estimating the number of synonymous and nonsynonymous sites in a codon
Here are some examples of functions for working with changes and synonymous sites:
library(seqinr)
translate<-seqinr::translate
## function to translate a codon:
dna_to_aa <- function(codon) {
dna<-s2c(codon)
aa<-translate(dna, frame = 0, sens = "F", numcode = 1, NAstring = "X", ambiguous = FALSE)
return (aa)
}
## function to determine if a change between two codons is synonym
is_synonymous<-function(codon1, codon2) {
return (dna_to_aa(codon1) == dna_to_aa(codon2))
}
## function to estimate the number of synonymous sites in a codon:
synpos<-function(codon) {
pos<-s2c(codon)
for (i in 1:length(pos)) {
base = pos[i]
BASES = c('A', 'G', 'T', 'C')
other_bases = BASES[BASES!=base]
syn=0
for (new_base in other_bases) {
new_pos=c(pos[0:(i-1)], new_base, pos[-(1:i)])
new_codon=paste(new_pos, collapse="")
s<-(is_synonymous(codon, new_codon))
syn <- syn + length(s[s==TRUE])
}
synp <- syn/3
}
return (synp)
}
Calculate the number of synonymous and nonsynonymous changes and sites in this coding sequence alignment using the functions above:
Seq 1 ATGATGCAGAGTCTGTAA Seq 2 ATGAGGCACAGTCTGTAA
The number of sites
of each class in each sequence is the sum over all codons in the
sequence, and the total number of sites of each class in the alignment
is the average over all sequences. You can apply the functions
separately for each codon, or consider a loop integrating functions such
as s2c() and splitseq() from the
seqinr package
Fortunately, you will not need to repeat this calculation for all the
gene sequences we will be studying in this lab. There are R
functions that can be used to calculate the synonymous and nonsynonymous
divergence of a coding region directly from the CDS alignment. The
seqinr package implements the KaKs() function,
which allows the calculation of these divergences. Before proceeding
with the calculations, we will change the sequence names in the fasta
file to make it easier to identify the species in the results matrix
(now the identifiers are very long and some of them do not contain the
species name):
library(phylotools)
old_names<-get.fasta.name("divergence/obp83b_cds_aln.fasta")
## new names must be in the same order than old names...
new_names<-c("Dmel_obp83b",
"Dsim_obp83b",
"Dere_obp83b",
"Dele_obp83b",
"Dpse_obp83b")
ref2 <- data.frame(old_names, new_names)
rename.fasta(infile = "divergence/obp83b_cds_aln.fasta", ref_table = ref2, outfile = "divergence/obp83b_cds_renamed_aln.fasta")
cds <- read.alignment("divergence/obp83b_cds_renamed_aln.fasta",format="fasta")
## matrix with synonymous divergences (Ks)
kaks(cds)$ks
## matrix with nonsynonymous divergences (Ka)
kaks(cds)$ka
In addition to the CDS, noncoding sequences are also found in eukaryotic genes. These sequences correspond to introns, exonic untranslated regions (5’ or 3’) or 5’ proximal regions, often containing promoters and regulatory elements. Here, you will analyse the divergence in the 5’ proximal region and the introns of the obp83b gene in the same five species.
First, you will obtain noncoding sequences of this gene for all
species. To do this, you need to know exon coordinates. Again, you can
use reutils functions to read the GenBank entries in your
R terminal:
## example for D. pseudoobscura:
x<-efetch("NC_046679", db="nucleotide", rettype="gb", retmode="text", strand=1, seqstart=16336491, seqstop=16337172)
con <- content(x, as = "textConnection")
readLines(con)
The
readLines()function prints the entry in GenBank format. Locate the feature “CDS” to find the coordinates of the coding region of the gene obp83b in this species.
Once you know the coordinates of CDS, you can extract 5’ and intron sequences:
## example for D.pseudobscura
## download the complete gene region
efetch("NC_046679", db="nucleotide", rettype="fasta", retmode="text", strand=1, seqstart=16336491, seqstop=16337172, outfile="divergence/Dpse_opb83.fasta")
## create an object with the sequence of the complete gene region
seq<-read.dna("divergence/Dpse_obp83b.fasta", format="fasta")
## create a new sequence with only intron sequences
## CDS = join(54..128,186..261,317..594)
## 5' region = 1-53
## intron 1 = 129-185
## intron2 = 262-316
seqnc<-seq
seqnc=seqnc[,c(1:53,129:185,262:316)]
## change species label
seqnc<-updateLabel(seqnc, labels(seqnc), "Dpse_obp83b")
## write a fasta file with noncoding sequences of the opb83b gene of ALL SPECIES (with append=TRUE, if your run this script for all species, all sequences will be added to the same file; remember to change the ids in each case)
write.dna(seqnc, format="fasta", file="divergence/obp83b_nc.fasta", append=TRUE)
Now we can align the non-coding sequences and estimate the number of
substitutions per site (KNC) of the obp83b
gene. For nucleotide divergences (without any distinction of positions,
just total divergence) you can use the function dist.dna()
from the package ape.
## read the noncoding sequences
myseqs<-read.dna("divergence/obp83b_nc.fasta", format="fasta")
## visualize the sequences
myseqs
## multiple sequence alignment
myalign <- clustal(myseqs)
## partial visualization of the generated alignment
myalign
## write the alignment to file
write.dna(myalign, file="divergence/obp83b_nc_aln.fasta", format="fasta")
## read the noncoding alignment
nc <- read.dna("divergence/obp83b_nc_aln.fasta", format="fasta")
## noncoding divergences in the opb83b gene
dist.dna(nc)
Fill empty cells in the following tables with the results obtained in the different divergence analyses and answer the questions:
Table 1. Divergence in the coding region of the gene opb83b(Ka and Ks values are above and below the diagonal, respectively)
| Dmel | Dsim | Dyak | Dana | Dpse | |
|---|---|---|---|---|---|
| Dmel | ——— | ||||
| Dsim | ——— | ||||
| Dyak | ——— | ||||
| Dana | ——— | ||||
| Dpse | ——— |
Divergence in the non-coding regions of the gene opb83b (Fill in the table with KNC values above the diagonal)
| Dmel | Dsim | Dyak | Dana | Dpse | |
|---|---|---|---|---|---|
| Dmel | ——— | ||||
| Dsim | ——— | ——— | |||
| Dyak | ——— | ——— | ——— | ||
| Dana | ——— | ——— | ——— | ——— | ——— |
| Dpse | ——— | ——— | ——— | ——— | ——— |
Questions:
1. Is the obp83b gene evolving rapidly? And the Obp83b protein?
2. In which of the analysed positions (synonymous, nonsynonymous or noncoding) are evolutionary distances higher? What is the reason for this, in your opinion?
3. Could you consider some of the observed changes as selectively neutral? Which ones?
4. Which species would share a more recent common ancestor? In which evolutionary distances (synonymous, nonsynonymous or noncoding) should we focus to know that?
Some of the most direct applications of nucleotide divergence
estimates are the reconstruction of phylogenetic relationships and the
inference of evolutionary rates. To reconstruct the phylogenetic
relationships among the five Drosophila species using the
information form the obp83b gene, you can use the functions
from ape and phangorn packages.
As an example, the following code reconstructs the phylogenetic relationships between the five species based on synonymous divergences. It uses the synonymous divergence matrix estimated above and the Neighbor Joining algorithm
library(phangorn)
cds <- read.alignment("divergence/obp83b_cds_renamed_aln.fasta",format="fasta")
## NJ tree
syntree <- NJ(kaks(cds)$ka)
plot(syntree)
write.tree(syntree, file="divergence/obp83b_Ks.tree")
By definition, a Neighbor Joining tree is an unrooted tree. There is no root node that is the common ancestor of all species. With an unrooted tree, we can tell the phylogenetic relationships between species (i.e. which species have a more recent common ancestor compared to the other species), but not the direction of evolution (i.e. which species arose more recently). In addition, branch lengths represent substitutions per site, in this case the number of substitutions per site. Sometimes, as in our example, we know which branch contains the root of the species under study (see Figure 2).
Figure 2. Phylogenetic relationships among the species of the genus Drosophila(modified from Mol Ecol Resour. 2022;22:1559–1581)
To calculate the evolutionary rate of the Obp83b protein, we need an ultrametic tree, i.e. a rooted tree in which the branch lengths represent time (not substitutions) and at least one calibration point (i.e. a dated node).
## read the tree to a phylo class object
tree<-read.tree(file="divergence/obp83b_Ks.tree")
## plot the unroted tree
plot(tree, main="Obp gene tree (number of nonsynonymous substitutions per site)")
## root the tree using Drosophila pseudoobscura as the outgroup
rooted<-root(tree, "Dpse_obp83b")
plot(rooted)
add.scale.bar(ask=TRUE)
## set calibration
## select the node corresponding to the ancestor of D. melanogater, D. simulans and D. erecta, and specify a range of 6.1-18.9 MYA
cal <- makeChronosCalib(rooted, interactive = TRUE)
## fit the model (penalized likelihood)
chr<-chronos(rooted, model="clock", calibration = cal)
# plot de calibrated (ultrametric) tree
plot(chr, main="calibrated tree (million years)")
axisPhylo()
tiplabels()
nodelabels()
chr$edge
chr$edge.length
The
tiplabels()andnodelabels()functions will tell you the id of each node in the tree (they are indicated in the tree). Withchr$edgeandchr$edge.lengthyou can find the correspondence between edge numbers and edge lengths, and thus the estimated date in million years of each node.
You now have all the parameters you need to calculate the evolutionary rate of the Obp83b protein in Drosophila.
Calculate the evolutionary rate (r) of the Obp83b protein in Drosophila using the divergences and times estimated in this practice. Think carefully about which divergence to use for this calculation and apply the formula.
Table 2. Identifiers and genomic coordinates of the rpL32 gene in different Drosophila species.
| Species | ID | Coordinates in the genome | Strand |
|---|---|---|---|
| D. melanogaster | NT_033777 | 30045229-30046161 | 2 |
| D. simulans | NC_052523 | 26165102-26166121 | 2 |
| D. erecta | NW_020825194 | 2228933..2229932 | 1 |
| D. serrata | NW_018367383 | 6093667..6094458 | 2 |
| D. pseudopobscura | NC_046679 | 811678-812625 | 1 |
Using the data in Table 2, calculate the rate of evolution of the protein RpL32 in Drosophila and answer the following questions. Note that for this gene D. elegans is replaced by D. serrata.
5. Which is the evolutionary rate of the RpL32 protein?
6. Are the evolutionary rates of the two proteins studied in this practice different?
7. Can you provide a molecular evolutionary explanation for the answer to question 6?
8. Which of the other positions examined here (i.e. nonsynonymous, noncoding, the full cds…) could have been used to reconstruct the phylogenetic tree prior to calibration?
9. Looking at your results and the tree in Figure 2, do you think that the fact that some of the species are not the same in both analyses (D. elegans in obp83b and D. serrata in rpL32) influences the results? Discuss
Submit this document with your
answers (or just a document with your answers to the questions in any
readable format, e.g. word, pdf, plain text…), and the R
code used to complete the final assignment to
AULAESCI
Deadline: June 28, 2024
Doubts? mailto:alejandro.sanchez@prof.esci.upf.edu